Entry Pairing in Inverted File
نویسندگان
چکیده
This paper proposes to exploit content and usage information to rearrange an inverted index for a full-text IR system. The idea is to merge the entries of two frequently co-occurring terms, either in the collection or in the answered queries, to form a single, paired, entry. Since postings common to paired terms are not replicated, the resulting index is more compact. In addition, queries containing terms that have been paired are answered faster since we can exploit the pre-computed posting intersection. In order to choose which terms have to be paired, we formulate the term pairing problem as a Maximum-Weight Matching Graph problem, and we evaluate in our scenario efficiency and efficacy of both an exact and a heuristic solution. We apply our technique: (i) to compact a compressed inverted file built on an actual Web collection of documents, and (ii) to increase capacity of an in-memory posting list. Experiments showed that in the first case our approach can improve the compression ratio of up to 7.7%, while we measured a saving from 12% up to 18% in the size of the posting cache.
منابع مشابه
An Inverted File Cache for Fast Information Retrieval
The inverted file is the most popular indexing mechanism used for document search in an information retrieval system (IRS). However, the disk I/O for accessing the inverted file becomes a bottleneck in an IRS. To avoid using the disk I/O, we propose a caching mechanism for accessing the inverted file, called the inverted file cache (IF cache). In this cache, a proposed hashing scheme using a li...
متن کاملMicrosoft Research at TREC 2011 Web Track
This paper describes our entry into the TREC 2011 Web track. We extracted and ranked results from the ClueWeb09 corpus using a parallel processing pipeline that avoids the generation of an inverted file. We describe the components of the parallel architecture and the pipeline, how we ran the TREC experiments, and we present effectiveness results.
متن کاملStructural Basis of Transcription Nucleotide Selection by Rotation in the RNA Polymerase II Active Center
Binding of a ribonucleoside triphosphate to an RNA polymerase II transcribing complex, with base pairing to the template DNA, was revealed by X-ray crystallography. Binding of a mismatched nucleoside triphosphate was also detected, but in an adjacent site, inverted with respect to the correctly paired nucleotide. The results are consistent with a two-step mechanism of nucleotide selection, with...
متن کاملCLIP: A Compact, Load-balancing Index Placement Function
Existing file searching tools do not have the performance or accuracy that search engines have. This is especially a problem in large-scale distributed file systems, where better-performing file searching tools are much needed for enterprise-level systems. Search engines use inverted indices to store terms and other metadata. Although some desktop file searching tools use indices to store file ...
متن کاملOn-Line Generation of Association Rules Using Inverted File Indexing and Compression
W e consider the problem of online mining of association rules in databases containing large numbers of transactions – and e specially in market-basket data. We focus primarily in the use of a proper data structure that would allow us to handle large numbers of data for online mining of association rules, and also in presenting and giving answers to new types of online queries. We borrow techni...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009